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Abstract 

The performance of EM in learning mixtures of product distributions often depends on the 
initialization. This can be problematic in crowdsourcing and other applications, e.g. when 
a small number of “experts” are diluted by a large number of noisy, unreliable participants. 
We develop a new EM algorithm that is driven by these experts. In a manner that differs 
from other approaches, we start from a single mixture class. The algorithm then develops 
the set of experts in a stagewise fashion based on a mutual information criterion. At each 
stage EM operates on this subset of the players, effectively regularizing the E rather than the 
M step. Experiments show that stagewise EM outperforms other initialization techniques for 
crowdsourcing and neurosciences applications, and can guide a full EM to results comparable 
to those obtained knowing the exact distribution, procedures. 


1 Introduction 

We study the model-based sparse clustering problem for discrete data using a mixture model of 
product distributions [9, 7]. This model has application in many fields, including computational 
neurosciences, crowdsourcing and bioinformatics, and is interesting because it differs technically 
from the problem for continuous data, where the well-known Gaussian mixture model has been 
applied successfully. 

A fundamental difficulty is that, in high-dimensional datasets, some features can be noisy, re¬ 
dundant or generally uninformative for clustering, and these can push clustering algorithms toward 
inappropriate or uninteresting results. If these uninformative or noise data points could be elim¬ 
inated then, we argue, the results should be much more satisfying. This is precisely our goal: to 
find an informative set of data points and to use these to drive the clustering. 

We illustrate our goal with a motivating example from neurosciences (Figure 1). Some neurons 
in mouse visual cortex respond well to certain grating orientations, while others do not respond 
systematically to gratings. This is called orientation selectivity [11], and we seek to organize neurons 
into orientation classes according to this activity automatically. The dataset consists of multiple 
neural spike trains obtained while the mouse viewed gratings with different orientations. The spike 
trains are converted into binary data indicating the presence (or absence) of an action potential 
during a short temporal interval. The problem is: given only the spike train data, are we able to 
group neurons into clusters that correspond to each orientation-tuned class? If we train a mixture 
model with all neurons using EM, the answer is negative; many of the neurons that are not well- 
tuned to orientation pollute the results. However, using only the informative neurons (cartooned 
as those with well-defined tuning curves) we are able to recover the clusters successfully, which 
underscores our goal to automatically identify those informative and relevant features for discrete 
data. The algorithms used for this example are developed in this paper, and applied at the end to 
crowdsourcing data as well. 
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Figure 1. Bernoulli mixture model learned from multiple neuron spike train data in mouse visual 
cortex illustrates our goal. Mice view gratings at one of 12 different orientations and an electrode 
records from multiple units. The gray box illustrates the given data. Some units are selective to the 
gratings, and some are not (uninformative neurons). If all neurons are used to learn a mixture model, 
the classes are ill-defined and the orientation tuning curves for each class are uniform (D). (A) shows 
a mixture model learned by the algorithm developed in this paper, which works by identifying the 
informative neurons and regularizing EM. In the cartoon these neurons correspond to those that are 
well tuned, as indicated by the solid box in (C). We emphasize that these tuning curves were not 
used by our algorithm but were derived from the classes it computed. Data courtesy M. Stryker, 
University of California at San Francisco. 


A similar problem for continuous data - the Gaussian mixture model with sparse means - is 
better studied. In particular, [12] proposed an algorithm based on a penalized likelihood function 
that leads to an EM variant with a regularized M-step, and [1] analyzes learning for a mixture of 
two isotropic Gaussians in high dimensions under sparse mean separation. As in these papers, we 
also consider a penalized likelihood function but for a mixture of discrete product distributions. 
However, we differ from them in that we directly regularize the E-step rather than the M-step. 

To regularize the E-step, we define an information-theoretic quantity and use it in a novel, 
stagewise fashion. Our measure is the sum of pairwise conditional mutual information of a certain 
hybrid distribution, defined below, which turns out to be closely related to maximum likelihood 
estimation and the EM algorithm. A similar idea appears in [17], who use the Euclidean distance 
between pairs of data points to regularize a K-means algorithm for sparse clustering. However, 
in our approach we select which variables to place into the informative set in a stagewise fashion. 
This stagewise technique is important because many researchers pointed out the drawback of using 
dimensionality reduction before clustering [3, 2, 4]. Importantly, as is also explained below, this 
involves starting with a single class and splitting it into multiple classes. We stress that our 
informative set is conceptually different from maximally informative dimensions [16]. 

The paper is organized as follows. After briefly introducing the problem setting for learning 
mixtures of discrete product distributions with sparse structure, our specific algorithm is presented 
in 3. In 4, we apply our algorithm in crowdsourcing data, to show its range of applicability beyond 
neurosciences, illustrate our information-theoretic measure and compare it with other state-of-the- 
art algorithms. 















































2 Background 


Throughout the paper, we use [a] to denote the integer set {1,2,..., a}. In a mixture of discrete 
component distributions (MDPD), it is assumed that each observation x is drawn from a finite 
mixture distribution f(X ) = Jf^ = i^kf(X\Y = /c;/i&). Y G [K] is the latent (non-observable) 
variable and ujk — f(Y — k ) denote the mixing weights; they satisfy Y^k u k — L We assume 
Xi G [R\ and f(X\Y; fik) is an M-dimensional discrete product distribution that can be factorized 
as f(X\Y]/jLk) = YliLi f(Xi\Y] /ifc). The conditional distribution, parametrized as — f(Xi = 
r|T = fc) and /i^ = [/q^i,...,/i^j, lives on the probability simplex. The set of all parameters is 
denoted by © = {(^kiPk) : k G [. K ]}. Given TV observations [x^\ ..., the goal of mixture 

model learning is to maximize the marginal log-likelihood: 

Z(0) = ^/o(X)log^/(VV©) = ^ E logE/(^ (n) ^; 0 )- (!) 

x y ne[w] k 

Since there are latent variables, the marginal log-likelihood is not convex, and EM has been used 
widely for learning mixture models. EM iteratively updates and optimizes a lower bound of the 
marginal likelihood function[10]. The lower bound is obtained by applying Jensen’s inequality to 
the log-likelihood function: 

US) > ^f„(X)^g(Y;X,9)lo S ^ (2) 

where q(Y]X,@) is a distribution over Y that may depend on X or ©. (We shall work on upper 
bounds shortly.) Let the current model be parametrized by ©L Then 

E-step: Calculate f(y\x^; for n G [TV] and set q{k ; x^ n \ ©) = f(y = k\x& t ). 

M-step: Maximize (2) with regard to 0. 

© t+1 argmaxQ(©; ©*) 

where Q(0; 0 4 ) = E M X ) E /X 1*5 0 ‘) lo S ( 3 ) 

We study MDPD in a high-dimensional, sparse setting. (The analogous problem for Gaussian 
mixture models has been studied by [12, 1, 14].) In this setting, the number of informative variables 
is much smaller than the dimension M. Let S denote the set of informative variables, \S\ « M, 
and let S — {i G [M]\i £ 5} be the complementary set. It is intuitive that the uninformative 
random variables Xi , i G 5 should not be distinguishable across the different mixture components , 
i.e. f 0 (Xi\Y = h) = f 0 (Xi\Y = k 2 ) for k u k 2 G [K]. 

Inspired by [12], we consider the following penalized maximum likelihood problem to encourage 
the sparse structure in the model. 

ma x/(@) a EhE DKL(fl’i\\fl’ki)\\o (4) 

i k 

where fi = ^f k ujkpki and ||-||o is the Iq norm. Dkl(p\\q) = ^2p^-°e(p/q) denotes KL-divergence. 
Dkl(p\\q) is non-negative and the equality holds if and only if p — q. Our penalty encourages sparse 
structure in the model because Y^k -^KL(jdiWPki) — 0 if and only if /i^ = /b for all k G [K] which 
indicates that the conditional probability of the random variable Xi is identical in the different 
mixture components. 
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3 Methods 


For convenience, assume that we can draw an infinite number of samples. To begin, 

Definition 1. Let /(X, K;0) be a MDPD with parameter 0 = k G [K]. 

Xi is informative if and only if E& DxL^iWuki) > 0 . 

It can be verified that if Xi is uninformative, the marginal distribution /(X) can be factorized 
as 

f(x ) = fiX/MiXi). (5) 

S C [M] denotes the index set of informative variables. Due to the factorization (5), 

f(Y\X) = f(Y\X s ). (6) 

Intuitively, uninformative variables will not affect the posterior distribution - they provide no 
information about the underlying latent variable. Now, we analyze the penalized likelihood function 
(4). The normal E-step leads to the penalized M-step 

yE^°( x )E/( F \ x > eW ) lp g~ A EII Y D KL(ik\\m)\\o- (7) 

The penalized M-step encourages a sparse update for the model and provides a way to determine 
S. By ( 6 ), it makes the E-step in the next iteration depend only on X 5 . However, solving (7) 
is hard, so we seek another way to determine S and, in the process, bypass the penalized M-step 
by using a regularized E-step, i.e. calculating f{Y\Xs). The following theorem motivates how we 
select S. 

Theorem 1 . Let fo(X) be a MDPD from which data are sampled and let S be the informative 
set of fo(X). If f(Xi\Y) = fo(Xi) for i G S, then 

D KL (f 0 (X)\\f(X)) = D KL {fo(X s )MXs)). 

Maximizing the likelihood function is equivalent to minimizing the KL-divergence loss Pkl(/o(T) | |/(X)), 
since D KL (fo(X)\\f(X)) = - H fo (X)-l(Q ) where H fo (X) = Ex /o (X) log f 0 (X) is the entropy of 
fo(X). The KL-divergence loss can be viewed as a measure of how well the model estimates fo(X). 

Theorem 1 suggests that, if we have an appropriate model for uninformative features, S could be 
recovered by solving the following dimensionality reduction problem: find the smallest S C [ M] 
such that 

D KL (f 0 (X)\\f(X)) = D KL (f 0 (Xs)\\f(Xs)). 

Although in practice S is unknown, it is easy to find a model f(X,Y) that satisfies the condition 
in the theorem. Simply pretend that all features are uninformative. Then f(X,Y) is just a one- 
component MDPD satisfying f(Xi\Y) = fo(Xi) for i G [M]. We use this idea to initialize our 
algorithm, which is distinct from the common practice of initializing mixture models with multiple 
components and random parameters. But two problems arise. First, the one-component model 
might not be a good one, since it does not capture any high-order interactions. It will have to be 
split, and the procedure to do this is in Section 3.2. Second, Dxl(/o(-E||/(X)) is computationally 
intractable. Our approach is to find a proper approximation to DKL,{fo{X)\\f(X)), based on which 
we can place variables into S. The details are as follows. 
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Figure 2. f7(/o||/s,e*) 
is an upper bound of 
-DffL(/o||/ 0 t+i) induced 
by EM. /(©*) denotes 
the global maximum of 
the log-likelihood. And 
Q(0,0*) is (3) with 
f(Y |X;0*) replaced by 

f(Y\x s m , ©*)• 


3.1 Conditional Mutual Information Approximates KL-Divergence Loss 

From now on, Dkx(/o(X)| |/(X)) is referred as TWl(/o||/©). We first define a hybrid distribution. 
Definition 2. The hybrid distribution is defined as 

/ 5 (X,y;0):=/(y|X 5 ;0)/o(X). (8) 

The hybrid distribution is a valid probability distribution as it is non-negative and sums to one. 
By using the hybrid distribution, the following theorem gives an upper bound on Dxl(/o||/©)- 

Theorem 2. 1 Let /s(X, Y; 0) be the hybrid distribution (8), /(X, Y ; 0 t ) be the model distribution 
at time t, and /(X, E;0 t+1 ) be the model distribution after one iteration of EM. Then, 

DKL(fo\\fe t + 1 ) < U(f 0 \\fs,Qt) 

where C(/o||/s,©0 = E Y ; 0*) log J? n 

XY lUfsiXilYie*) 


The geometric interpretation of the theorem is provided in Figure 2. This theorem is a direct 
result of Jensen’s inequality and the EM algorithm. By information theory, 

U(fo\\fs&) = E H f s ,eM\ Y ) ~ H f s , e S X \ Y )- (9) 

ie[M\ 

The first term in (9) involves the singleton marginal conditional entropy Hj t (Xi\Y) which is 

computationally tractable. However, because /(X|E;0) = ^x)f(Y\x^Q) an< ^ /o(X) cannot be 

factorized in most cases, the second term Hf (X\Y) is computationally intractable. To tackle 
the intractability, we further approximate Hj t (X\Y) with the Bethe entropy approximation. 

Recall, in graphical models, X = [X^] are random variables associated with vertices V and /(X) 
is the joint distribution associated with the graph G(V, E ). The Bethe entropy approximation [13] 
is defined as 

H(X)^H Bethe = J2H(Xi)- E I(X„X t ) 

(s,t)EE 

where I (X s , Xt) is pairwise mutual information. The Bethe entropy approximation is accurate for 
acyclic Markov random fields. 
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Applying the Bethe entropy approximation to the second term in (9) yields an approximation 
to the conditional entropy: 


H i s J x '' Y) * £ fl ; e . (W) -£//„(*,w 

l i^j 


( 10 ) 


where (X,, Xj\Y) = £ f s (X, Y ; 0*) log - 


fsiX^XjlY;#) 


Now, combine (9) and (10) to approximate an upper bound for the KL-divergence loss: 


U(fo\\fs,et) « Y'lf^iX^XjlY) ( 11 ) 

*7 1 3 

The approximation consists of pairwise conditional mutual information. It breaks the curse of 
dimensionality for KL-divergence loss and the computational complexity of JT , • If (Xi,Xj\Y) 

'3 J s,@t 

is 0(KNM 2 R 2 ). It leads to an operational version of Theorem 1: 

Proposition 1 . Under the same conditions as in Theorem 1, we have 

£ I i s ,J x - x i\ Y )=Y I ; s J x '’ x W < 12 > 

i,je[M] ijes 


Thus we can recover S in a similar way to that suggested by Theorem 1 . Moreover, if the model 
fits the data perfectly, ( 11 ) would be zero. 

Proposition 2. If D KL (f 0 \\f e t) = 0, then I f s Qt ( x i , x j\ Y ) = 0 

In effect from Proposition 1, if L,e (Xi,Xj\Y) is large for some feature pair (i,j), we can 
conclude that both i and j are informative. On the other hand, from Proposition 2, the model 
doesn’t fit the data well in those dimensions. Therefore, i and j are significant for model learning 
and should be used to regularize the E-step. This is the key idea that underlies our algorithm. 

3.2 Algorithm: Stagewise EM 

Our main algorithm - stagewise EM - is now developed. Following convention (one-hot en¬ 
coding), let G {0,1}^ be an observation of coordinate X The model is initialized as a 
one-component MDPD such that the conditional distribution of each feature equals the corre¬ 
sponding frequency in the observations. For uninformative features, this initialization is already a 
good estimate. 

Theorem 3. 1 For finite observations, redefine fs(X,Y;Q) := f{Y\Xs]Q)fo(X). The regular¬ 
ized E-step is to calculate f(Y\X$ — based on current model /(X,K;©*) and S. The 

corresponding M-step is given by 

w fc +1 fs(Y = k; Q f ) 

n 1 & 1 ^fs(x i \Y = k ] e t ) 
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Algorithm 1: stage-wise EM 
INPUT: {>) : n € [iV]} and tf target 
OUTPUT: {uik,Hk ■ k € [-^target]} 

Initialization: If = 1, wi = 1 and {jin = J2 n e[N] x iS 
while Not Converge do 

Calculate I? {X^Xj)\{Y — k ) for current model. 

IS 1 ,©*' 

Find the biggest entry (i,j,k) in Ij t (Xi,Xj)\(Y = k). 
if i 0 S or j S then 
Add i and j into S 
if K <C Kt ar get then 

Duplicate the fc-th component and perturb in i j coordinate (explained in context). 

end if 
end if 

Perform regularized E-step and M-step in Theorem 3 

end while 


Thus stagewise EM iteratively performs a regularized E-step followed by a corresponding M- 
step. But, to regularize the E-step, the informative set S has to be obtained explicitly, which we 
do in an interlaced fashion together with the EM iterations. Specifically, since at least one EM 
iteration is needed for each update of S', the algorithm works conservatively and attempts to update 
S after each iteration of EM. 

We now develop the update for informative set S. By a standard result in information theory, 

I^jX^XjlY) = ^2fs(y = k', ® t )If s et (Xi, Xj)\(Y = k) (13) 

k 


where Ii 


fs,et 


(Xi,Xj)\(Y = k) = ?: XitXj fs(X i ,X j \Y = k] 0*) log 

fn,( ^ 




updated by picking the biggest triplet (i,j, k) in (X*, Xj)\(Y = k) and adding the related 

indices, i and j, into S (if they are not already in S). The stagewise update is a strong regularization 
on EM, as it enforces EM on the features that are informative and have not been fitted well in the 
current model. We use the word “stagewise” because a similar idea has been applied to regression 


[ 6 , 8 ]. 

An important detail remains. Since the algorithm is initialized as a one-component model, or 
during iterations, we may need to increase the number of components. We do this by splitting one, 
and again act conservatively: First find the largest triplet (i,j, fc), duplicate the k -th component, 
and add it into the mixture model: /i new % f° r ^ ^ [Tf] and Wnew^]^ 1 0.5^. 


Theorem 4. 1 Let /(X, Y;Q old ) be the model before duplication and /(X, Y]® new ) the one after 
duplication. It can be shown that 


Y. I u^ x *’ x i i y > = w 
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Intuitively, since the duplication does not alter the marginal distribution /(X;©), the KL- 
divergence loss remains unchanged. Theorem 4 indicates that Ij s Q (Xi,Xj\Y) also remains 

the same. To break the symmetry between the k -th component and the new component, we freeze 
all parameters © in 1j s Q (Xi, Xj\Y) except for /i^, /qy, /x neW z, and /i new j and perturb the 

model with regard to the free parameters. Due to the symmetry, it can be shown that © new 
(parameters after duplication) is a saddle point to the restricted function. Therefore, we calculate 
the Hessian of ^f so Xj\Y) with regard to the free parameters and perturb in the direction 
of the eigenvector with the most negative eigenvalue of the Hessian. 

4 Empirical Studies 

Following [5], model-based learning in crowdsourcing can be viewed as a special case of MDPD. 
Now Xi G [i?] is the label given by the z-th worker (z G [. M ]) to an item with true label denoted by 
Y G [K\] this requires R — K. The goal is to estimate the true label for each question and to assess 
the individual workers’ performances. All the examples below are ct-sparse crowdsourcing, in which 
only \aM ] workers give the true label with some probability; the other workers give random labels 
with unknown probability. As we show, stagewise EM performs well against the state-of-the-art 
crowdsourcing algorithms. 

We first study the behavior of the informative set S using a simulation of 0.3-sparse MDPD 
/o(X) with 3 components (K — 3). 100 workers provide labels to 1000 items with the 30 informative 
(“expert”) workers enjoying decreasing capabilities: the first worker provides the true label with 
probability 0.7 and the 30th worker with probability 0.45. The rest of the workers are random. 



Figure 3. The performance of stagewise EM on 0.3-sparse MDPD: from the left to the right are 
log-likelihood, max norm of conditional mutual information, and the size of the informative set S 
against the number of iterations. Dashed lines are benchmarks obtained from the underlying true 
distribution. 

We perform 20 iterations of stagewise EM (Figure 3). The benchmark log-likelihood is given by 
Yx fo(X) log fo(X) = and the benchmark conditional mutual information 

(middle panel) is also obtained with the true distribution and training data. According to Propo¬ 
sition 2, the max norm of the conditional mutual information evolves toward 0. The algorithm 
converges within 10 iterations and the size of \S\ < 10. A more detailed look at the mutual infor¬ 
mation criterion (11) for dimensionality reduction is illustrated in Figure 4. By construction, the 
workers’ capabilities decay as the index increases. Note how stagewise EM rapidly identifies the top 


1 See proofs in the supplement 











t - n t = 3 t = R t - 7 



worker# S = [1,3] S = [1,3,2] S = [1,3,2,0,4] S = [1,3,2,0,4,11] 


Figure 4. Illustration of how S evolves and regularizes EM. The diagonal entries are of no interests 
and therefore eliminated. The leftmost panel shows the conditional mutual information I^{X^Xj) 
at t = 0. The next four top panels show the Ij^ Qt (JQ, Xj\Y) for the first 30 workers at iteration 
t = 0, 3, 5, 7, while the bottom panels present the mutual information among workers in S. For each 
iteration, the informative set S regularizes the E-step. 


5 most informative workers. The full S for this task is (in order) [1, 3, 2, 0,4,11, 6,15,48, 77]. The 
first 8 in S are all top 15 workers and, from Figure 3, the algorithm almost converges at that point; 
the later members of S are not important. This further suggests that the algorithm seems able 
to “decide” how much information is needed to learn the model; although 30 informative workers 
exist, practically less than 10 are needed for a good estimation. 

We now study the prediction performance by comparing our algorithm (stage-EM) against 
the spectral algorithm (Spec-EM) [18] and the majority-vote-initialized EM (MV-EM) commonly 
used in practice. We start with synthetic data for the <a-sparse crowdsourcing problem with 100 
workers, 1000 samples, and 3 labels. The first 100<a workers are informative giving correct labels 
with probability 0.6; the rest are uninformative workers giving labels at random with unknown 
probability. We vary a E [0.05,0.2] and, for each <a, the experiment is repeated for 10 times. In 
Figure 5, we show prediction performances achieved by different algorithms. The benchmark score 
is the prediction error by the true model. Spec-EM does not work in this sparse setting, MV-EM 
is able to keep up with our algorithm until a becomes small, while stage-EM stays close to the 
benchmark for all a. Our algorithm consistently outperforms the other methods in this sparse 
setting. 

Finally, we turn to real datasets (Table 1), even though they are not sparse. The bluebird 
dataset [15] is a binary labeling task containing 108 items, 39 workers and 4,212 observed labels. 
The dog dataset [19] contains 4 different dog breed labels from ImageNet. Since these datasets are 
incomplete, we add a new “missing label” which indicates that the worker does not label this item. 
The probability of not giving a label is assumed to be independent of the true label. It is estimated 
from the data for each worker and then frozen (not trainable) during model fitting. Since these 
datasets are not sparse, we run regular EM on the complete dataset after model fitting to leverage 
all the information (StageEM-refine). As shown in Table 1, it is still comparable with MV-EM 
and surpasses Spec-EM. Importantly, stagewise EM has decent prediction performance using only 
about 1/3 of the workers available in both datasets. 
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Algorithm 


Bird 

(2 labels) 

Dog 

(4 labels) 

StageEM- 

refine 


10.19 

16.73 

StageEM 


12.04 

20.69 

(\S\/M) 


(11/39) 

(14/52) 


Low 

11.11 

16.98 

Spec-EM 

Average 

11.57 

22.19 


High 

12.04 

31.85 

MV-EM a 


11.11 

16.66 


Figure 5. Prediction performance on 3- 
label o-sparse crowdsourcing model: For each 
o, experiments are repeated 10 times. The 
shaded error bar shows the best and the worst 
performance. 

5 Conclusion 


Table 1. Prediction error (%) on real 
datasets. |S| is the size of informative set, 
while M is the total number of workers. 

“Refer to the results reported in [18] 


We developed a stagewise EM algorithm for sparse clustering of discretely-valued data. The key 
insight is that uninformative features should have uniform probability of belonging to any mixture 
class. This led to an informative set of features via a mutual information criterion and a practical 
algorithm by approximating it with Bethe entropy. The result performed well for neurosciences 
and crowdsourcing datasets. 
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Supplements 

Proof: Theorem 1 

Since S is the informative set, by (5), 


fo(X) = fo(Xs)Ylfo(Xi). 

ies 

In the theorem, we set f(Xi\Y) — fo(Xi) for i G S. Hence, we also have 

f(X) = f(Xs)l[f(X i ) 

ies 

= f(X s )Y[fo(Xi). 

ies 


For KL-divergence, 


D KL (f 0 (X)\\f(X)) = Y t MX)log 

X 


MX) 

f(x) 


= E Mx) log 


MXs)Il ie sMXi) 

f(Xs) Ylies f(Xi) 


= E MX) log 


Mx s ) 

f(Xs) 


= D KL (MXs)\\f(Xs)) 


Proof: Theorem 2 

D KL (fo\\fe^) = -Hf 0 (X) - l(0 t+1 ) 
By the standard result from EM, Z(@ i+1 ) is lower bounded 

l (@ t+1 ) > maxQ(0;0 t ) 


where Q(&; 0*) = J2x M x ) J2 y fi Y \ x s] ©*) log f(Y\x-e*) • Let fs( x , Y; 0) be defined as ( 8 ) and 
by Theorem 3, 


maxQ(0; 0*) = ^ f s (X, Y ; 0') log 
x,y 


UiemfsiXilY-^fsiY^) 
f(X |X;0‘) 


Using the lower bound of /(©) to upper bound Axl(/o||/© t+i) gives the desired result. 

Proof: Theorem 3 

In finite observation case, after a regularized E-step, M-step is to 

E f(v = log f(x^ n \y = k;G) 

n ke[K] 
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By the factorization of MDPD, it can be written as 


E f( y = k \x (n) ;Q t ) ( E f (x ( i n) \k-e)+log f(y = k;@) j . 

n ke[K] \ie[M] / 

By the parameterization of MDPD, the discrete distribution f(xi\k) is represented by /i^ = 
[mi, ■ ■ ■, mR] T and f(y) by lo = [uj k 1 ,.. .,u} k 1 ], which satisfying 

J2mr = l and E Wfe = 1. 

r fc 

Therefore, we can maximize over different and cj separately. This constrained optimization 
problem is solved by applying Lagrangian multiplier, which gives 

i/NY. n f(y = k\xm>m, r 

m ' Er 1 / N E„f(y = 

l/A r E»/(j/ = t|i |n) ;e 1 ) 

“* z k i/NE„ny=m;6'y 

t is indicator function. We define fs(X, E;0) := f(Y |Xs; @)/o(X), therefore the probability for 
any instance (x,y) is 


/ S (X = x,Y = y;Q) = L ^/(y|4 n) ; 0)1 


And we can reformulate the results as 




t +1 


fsM Y = k ) 


;( n )=nr 


f%r <- /s,e‘(*i = = *0 

Proof: Theorem 4 

It is enough to prove that for all pair 

The conditional mutual information can be decomposed as 

/ /se pq,^-|y) = = ^;©V/ Sje (^Wi)l(T = k). 

k 

Let the component &2 be the duplication of the component fci. We notice that 

f /^(T = k] © old ), if k 7 ^ fci or &2 


(14) 


fs(Y = h,e r 


p/s(y = /ci; 0 old ), if k = ki or k 2 


Moreover, duplication does not change the conditional distribution within each component, so we 
have 


(y = k) = 

Therefore, we know that (14) holds. 


If jXi,Xj )\(y = fc), if fc ^ fci or fc 2 

^,© old 

/ /seoId (^Wi)|(T = Ah), if fc = *! or k 2 
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Proof: Proposition 1 

The proposition follows from the fact that if i G S', then 

fsM X iY) = f(Y\X s ^@)MX /i )f 0 (X i ). 

The only term containing X{ is /opQ). Therefore, X{ is independent of other features in the hybrid 
distribution, which leads to the proposition. 

Proof: Proposition 2 

If DjcLifoWfe*) = 0, then /(X;©*) = /o(X). And the hybrid distribution becomes 

f s<et (X,Y) = f(Y\X s; e t )f(X-,e t ) 

By Proposition 1, it is enough to show that Ij t (Xi,Xj\Y) = 0 for {(i,j)\i,j G S and i 7 ^ j}. 
This is true because X{ and Xj is independent conditional on Y. 
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